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Final  Report  for  Proposal  No.  56268-NS-II 
Meta-Learning  Assistants  Using  a  Novel  Characterization 

of  Data  Landscapes 

Ricardo  Vilalta 


Summary  of  Project  Activities 

Our  project  focused  on  the  mathematical  foundations  needed  to  build  meta-learning  assistants. 
The  overall  goal  is  to  know  how  we  can  acquire  and  exploit  knowledge  about  learning  (i.e.,  meta¬ 
knowledge)  to  understand  and  improve  the  performance  of  learning  algorithms.  To  that  end, 
our  work  focused  on  the  following  research  problem:  how  can  we  decide  if  one  single  complex 
model,  or  rather  a  combination  of  simple  models,  is  the  best  strategy  to  use  when  we  face  a 
supervised  learning  task?  Our  results  show  that  a  combination  of  simple  models  is  often  the 
best  choice,  as  a  minimum  increase  in  model  complexity  is  equivalent  to  tenths  of  simple  models. 

Figure  1  shows  a  diagram  illustrating  our  main  ideas.  Traditional  approaches  to  model 
selection  vary  complexity  by  jumping  between  model  families  iq;  every  single  model  in  the  new 
family  is  able  to  create  more  flexible  decision  boundaries  compared  to  any  single  model  in  the 
first  family.  Alternatively,  complexity  can  vary  by  combining  multiple  models  into  a  composite 
model  (while  fixing  the  complexity  of  each  single  model  in  the  first  family);  every  model  in  the 
new  family  F^  is  the  result  of  combining  k  models  from  the  first  family  Ft.  New  models  are  also 
more  complex  but  due  to  the  composite  approach.  The  question  is  how  do  these  two  approaches 
compare?  How  much  complexity  is  precisely  increased  with  each  approach?  When  combining  k 
models,  how  far  can  k  increase  until  complexity  grows  above  the  traditional  approach  of  invoking 
single  complex  models?  By  answering  these  questions  we  open  the  possibility  of  including  both 
approaches  in  the  same  model  selection  strategy,  while  expanding  our  understanding  of  learning- 
algorithm  designs. 

Our  Theoretical  Analysis 

In  what  follows  I  provide  a  detailed  description  of  our  theoretical  analysis  (a  full  description  can 
be  found  in  our  conference  paper  at  ICAART  (Vilalta  et  al.,  2010)).  We  showed  the  conditions 
under  which  combining  multiple  local  models  is  expected  to  be  beneficial.  In  essence  we  wish  to 
compare  a  composite  model  Mc  to  a  basic  global  model  Mj,.  Mc  is  the  combination  of  multiple 
models.  We  assume  Mf}  has  VC-dimension  hb  and  Mc  has  VC-dimension  hc,  which  comes  from 
the  combination  of  k  models,  each  of  VC-dimension  at  most  h,  where  we  assume  h  <  hb- 

The  question  we  address  is  the  following:  how  many  models  of  VC-dimension  at  most  h  can 
Mc  comprise  to  still  improve  on  generalization  accuracy  over  Mb,  assuming  both  models  have 
the  same  empirical  error?  The  question  refers  to  the  maximum  value  of  k  that  still  gives  an 
advantage  of  Mc  over  Mb.  To  proceed  we  look  at  the  VC-dimension  of  hc,  which  in  essence  is 
the  VC-dimension  of  Ufold  unions  or  intersections.  It  is  an  open  problem  to  determine  the  VC- 
dimension  of  a  family  of  Ufold  unions  (Reyzin,  2006;  Blurner  et  al.,  1989;  Eisenstat  and  Angluin, 
2007);  recent  work,  however,  shows  that  such  a  family  of  models  has  a  lower  bound  of  ^kh,  and 
an  upper  bound  of  2/c/ilog2  3 k  (it  has  been  shown  that  0(nk  log2  k)  is  a  tight  bound  (Eisenstat 
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Figure  1:  Two  types  of  model  selection.  Top:  Complexity  is  increased  by  looking  at  families  of 
models  Fj  with  increased  flexibility  in  the  decision  boundaries.  Bottom:  Complexity  is  increased 
by  combining  k  models  while  fixing  the  complexity  of  each  model.  stands  for  the  combination 
of  k  models  of  family  Fi.  If  we  could  compare  both  approaches  -as  in  this  example-  we  could 
say  that  model  family  F\ 3  is  less  complex  than  family  F2,  which  in  turn  is  less  complex  than 
family  F\±. 

and  Angluin,  2007)).  We  begin  our  study  with  the  lower  optimistic  bound,  and  assume  the 
VC-dimension  of  hc  to  be  | kh.  To  solve  the  question  above  we  equate  Vapnik’s  guaranteed  risk 
for  both  Mc  and  Mj,: 
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where  our  goal  is  now  simply  to  solve  for  k.  After  some  algebraic  manipulation  we  get  the 
following: 


where  c\  and  C2  are  constants: 


c\k  —  /c  In  k  =  C2 
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Figure  2:  Left:  A  comparison  of  a  compound  model  using  k  ( k! )  support  vector  machines  with 
polynomial  kernels  of  degree  one  vs  a  simple  support  vector  machine  with  a  polynomial  kernel  of 
degree  two;  Right:  same  comparison  except  the  simple  support  vector  machine  has  a  polynomial 
kernel  of  degree  four.  The  degree  of  the  polynomial  kernel  makes  little  difference  in  the  results. 


Equation  2  can  be  formulated  as  a  transcendental  algebraic  equation.  We  can  transform  the 
equation  as  follows: 

-c2ir1e-C2fc_1  =  — c2e_Cl  (5) 

To  solve  for  k  we  can  use  Lambert’s  W  function: 


W{— c2e  Cl) 

where  W  ca,n  be  solved  using  a  numeric  annroximation. 


(6) 


Figure  3:  Left:  A  comparison  of  a  compound  model  using  k  ( k ')  support  vector  machines 
with  polynomial  kernels  of  degree  two  vs  a  simple  support  vector  machine  with  a  polynomial 
kernel  of  degree  three;  Right:  same  comparison  except  the  simple  support  vector  machine  has  a 
polynomial  kernel  of  degree  five.  The  degree  of  the  polynomial  kernel  makes  little  difference  in 
the  results. 

A  similar  analysis  can  be  done  using  the  upper  bound  of  hc  =  2/c//ilog2  3k' ,  where  we  use  k' 
to  differentiate  from  the  k  used  with  the  lower  bound.  After  some  algebraic  manipulation  we 
get  the  following  equation: 


C3V  —  i' In  v  =  C4 


(7) 
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where  v  =  In 3/c7,  and  C3  and  C4  are  constants  (only  slightly  different  than  before): 


2  h 
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Since  equations  2  and  7  have  the  same  form,  v  has  the  same  solution  as  k  (equation  6): 

-c4 
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We  can  then  do  the  substitution  back  to  k'  to  obtain  the  following: 


(8) 
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k!  ln  3  k!  =  C5 


(11) 


c5(fe/)“1eC5(fc')"1  =  3c5  (12) 

It  is  now  possible  to  solve  for  k': 


W(3c5) 

To  summarize,  we  have  shown  how  to  express  the  number  of  /c-fold  (and  /d-fold)  unions  of 
models,  each  with  VC-dimension  h,  such  that  the  resulting  compound  model  exhibits  the  same 
guaranteed  risk  as  a  single  model  with  VC-dimension  hb  (we  assume  of  course  that  h  <  /if,). 
To  clarify,  we  handle  two  bounds,  k  and  k1,  because  of  our  uncertainty  in  the  VC-dimension  of 
model  unions.  In  principle  we  know  there  is  a  k",  that  stands  as  the  exact  bound,  below  which 
Mc  retains  an  advantage  over  Mb- 

We  can  now  study  the  effect  on  k  (and  k')  as  we  vary  parameters  such  as  the  size  of  the 
training  set,  or  the  VC-dimension  of  the  models  in  the  composite  model  Mc  (as  compared  to 
the  global  model  Mb).  Figures  2  and  3  show  plots  on  how  the  number  of  model  unions  varies 
with  different  values  of  N.  In  each  case  we  take  the  compound  model  as  the  union  of  k  (and 
k')  support  vector  machines,  where  the  simple  global  model  is  a  single  support  vector  machine. 
We  assume  the  use  of  polynomial  kernels  where  the  VC-dimension  of  each  model  is  defined 
as  (Burges,  1998): 
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n  +  p-  1  \  +1 
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(14) 


where  n  is  the  dimensionality  of  the  input  space  and  p  is  the  degree  of  the  polynomial.  In 
Figure  2  we  assume  a  compound  model  with  polynomial  kernels  of  degree  p  =  1.  The  global 
model  varies  from  a  polynomial  degree  p  =  2  (Figure  2-left)  to  a  polynomial  degree  p  =  4 
(Figure  2-right).  In  all  cases  we  assume  n  =  5.  It  is  clearly  observed  that  the  value  of  k  {k') 
increases  linearly  with  N .  As  expected,  k!  corresponds  to  a  less  inclined  line  as  the  upper 
bound  on  the  VC-dimension  lowers  the  number  of  models  we  can  place  at  the  composite  model 
while  still  generating  less  variance  as  the  single  model.  In  addition,  a  higher  difference  in  VC- 
dimension  (Figure  2-right)  shows  almost  no  difference  in  the  shape  of  k  ( k' )  for  different  values 
of  N.  The  right  y-axis  on  each  graph  is  the  log2  of  the  values  on  the  left  y-axis;  it  is  simply 
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an  indicator  of  how  many  local  models  we  could  arrange  in  a  hierarchical  structure  (assuming  a 
binary  tree)  while  still  generating  less  variance  as  the  global  model.  We  observe  that  for  large 
values  of  N  (e.g.,  N  >  500),  large  hierarchies  can  be  employed  with  little  effect  over  the  variance 
component. 

Figure  3  assumes  a  compound  model  with  polynomial  kernels  of  degree  p  =  2.  The  single 
model  varies  from  a  polynomial  degree  p  =  3  (Figure  3-left)  to  a  polynomial  degree  p  =  5 
(Figure  3-right).  The  same  effect  is  observed  as  before  except  under  a  different  scale.  In  all 
graphs  we  observe  a  large  advantage  gained  by  the  combination  of  many  low-complex  models 
as  compared  to  a  single  model  exhibiting  higher  complexity.  The  difference  grows  linearly  on  N 
and  is  considerable  for  N  >  500. 

Conclusions 

Our  study  shows  the  advantage  that  comes  when  a  piece-wise  model  fitting  approach  is  used  in 
classification.  This  is  justified  by  the  difference  in  the  rate  of  complexity  obtained  by  augmenting 
the  number  of  boundaries  per  class  (composite  model)  to  the  increase  in  complexity  obtained  by 
augmenting  the  capacity  of  a  single  global  learning  algorithm  (classical  approach).  The  former 
enables  us  to  increase  the  model  complexity  in  finer  steps. 

Our  future  goal  is  to  use  these  results  in  building  a  model  for  the  classification  of  classes 
and  sub-classes  in  hierarchical  learning  problems.  The  key  idea  is  to  try  to  combine  simple 
models  as  comprehensively  as  possible  before  any  attempt  is  done  to  apply  complex  models  for 
classification. 
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